1. Citation
Citation: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
2. About dataset
The scope of this analysis is to understand relationship of various parameters which impact the quality ratings for both Red and White wine.The data set utilized for the analysis is downloaded from https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityWhites.csv
3. Number of Instances:
red wine - 1599; white wine - 4898.
4. Number of Attributes:
11 + output attribute
5. Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
6. Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
# Packages used in this EDA
library(ggplot2)
library (gridExtra)
## Loading required package: grid
library(GGally)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:GGally':
##
## nasa
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(psych)
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:ggplot2':
##
## %+%
## [1] 6497 14
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "color"
## 'data.frame': 6497 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ color : chr "red" "red" "red" "red" ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.80 Min. :0.08 Min. :0.000
## 1st Qu.: 813 1st Qu.: 6.40 1st Qu.:0.23 1st Qu.:0.250
## Median :1650 Median : 7.00 Median :0.29 Median :0.310
## Mean :2044 Mean : 7.21 Mean :0.34 Mean :0.319
## 3rd Qu.:3274 3rd Qu.: 7.70 3rd Qu.:0.40 3rd Qu.:0.390
## Max. :4898 Max. :15.90 Max. :1.58 Max. :1.660
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.60 Min. :0.009 Min. : 1.0 Min. : 6
## 1st Qu.: 1.80 1st Qu.:0.038 1st Qu.: 17.0 1st Qu.: 77
## Median : 3.00 Median :0.047 Median : 29.0 Median :118
## Mean : 5.44 Mean :0.056 Mean : 30.5 Mean :116
## 3rd Qu.: 8.10 3rd Qu.:0.065 3rd Qu.: 41.0 3rd Qu.:156
## Max. :65.80 Max. :0.611 Max. :289.0 Max. :440
## density pH sulphates alcohol
## Min. :0.987 Min. :2.72 Min. :0.220 Min. : 8.0
## 1st Qu.:0.992 1st Qu.:3.11 1st Qu.:0.430 1st Qu.: 9.5
## Median :0.995 Median :3.21 Median :0.510 Median :10.3
## Mean :0.995 Mean :3.22 Mean :0.531 Mean :10.5
## 3rd Qu.:0.997 3rd Qu.:3.32 3rd Qu.:0.600 3rd Qu.:11.3
## Max. :1.039 Max. :4.01 Max. :2.000 Max. :14.9
## quality color
## Min. :3.00 Length:6497
## 1st Qu.:5.00 Class :character
## Median :6.00 Mode :character
## Mean :5.82
## 3rd Qu.:6.00
## Max. :9.00
Observations from the summary
1.The alcohol content varies from 8.00 to 14.90 for the samples in dataset.
2.The quality of the samples range from 3 to 9 with 6 as median and 5.818 as mean.
3.The range for fixed acidity is quite high with minimum being 3.8 and maximum being 15.9.
4.pH value varies from 2.720 to 4.010 with a mean of 3.219 and median of 3.210.
5.Mean residual sugar is 5.443 but the max value is 65.800 indicating an outlier.
6.free.sulfur.dioxide has a mean of 30.53 and a high of 289.0.
Analysis of all the single variables using plots
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.80 6.30 6.80 6.85 7.30 14.20
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Observation about fixed acidity of wine:
Fixed.acidity also known as Titratable acidity, either occur naturally in the grapes or are created through the fermentation process. Red wine seems to be more acidic than white wine as can be seen from the min, mean and max value of fixed.acidity in the summary. (http://winemakersacademy.com/understanding-wine-acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.08 0.23 0.29 0.34 0.40 1.58
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.120 0.390 0.520 0.528 0.640 1.580
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.080 0.210 0.260 0.278 0.320 1.100
The volatile.acidity is slightly skewed as you can even though the mean is 0.3397 the max is 1.58, this is because of the red wine max value 1.58 so using scale_x_log10 to further analyze this.
Observation about Volatile acidity of wine:
The majority of the volatile.acidity seems to be between 0.23 to 0.78. Our palates are quite sensitive to the presence of volatile acids and for that reason their concentrations should be as low as possible.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.250 0.310 0.319 0.390 1.660
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.270 0.320 0.334 0.390 1.660
The citric.acid is slightly skewed so using scale_x_log10 to further analyze this. This is maybe because even though the mean is .271 for red and .334 for white there are some outliers.
Observation about Citric acidity of wine:
From the summary min is 0.0 and since the graph shows the number of wine having 0.0 to be between 0 and 250, wanted to see how many were either not reported or had a 0 value.
## [1] 151
There are around 151 observations had a value of 0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.60 1.80 3.00 5.44 8.10 65.80
##
## 0.6 0.7 0.8 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3
## 2 7 25 39 4 93 1 146 3 187 3 147
## 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 1.8 1.85 1.9
## 2 184 4 142 2 165 2 99 1 99 3 59
## 1.95 2 2.05 2.1 2.2 2.25 2.3 2.35 2.4 2.5 2.6 2.65
## 2 79 1 51 56 2 42 1 41 40 33 1
## 2.7 2.8 2.85 2.9 3 3.1 3.15 3.2 3.3 3.4 3.5 3.6
## 38 36 1 25 17 17 1 28 23 13 31 22
## 3.7 3.75 3.8 3.85 3.9 3.95 4 4.1 4.2 4.25 4.3 4.35
## 12 2 21 3 17 3 19 17 31 2 19 1
## 4.4 4.45 4.5 4.55 4.6 4.7 4.75 4.8 4.85 4.9 5 5.1
## 14 3 33 2 40 29 5 38 1 35 43 28
## 5.15 5.2 5.25 5.3 5.35 5.4 5.45 5.5 5.55 5.6 5.7 5.8
## 2 29 4 17 2 23 2 13 1 16 30 23
## 5.85 5.9 5.95 6 6.1 6.2 6.3 6.35 6.4 6.5 6.55 6.6
## 2 19 1 23 21 31 39 1 34 26 1 30
## 6.65 6.7 6.75 6.8 6.85 6.9 6.95 7 7.05 7.1 7.2 7.25
## 3 25 1 28 6 20 1 31 2 36 29 2
## 7.3 7.35 7.4 7.45 7.5 7.6 7.7 7.75 7.8 7.85 7.9 7.95
## 19 2 40 1 30 29 34 2 41 1 32 1
## 8 8.1 8.15 8.2 8.25 8.3 8.4 8.45 8.5 8.55 8.6 8.65
## 32 34 1 36 2 31 13 1 24 1 27 1
## 8.7 8.75 8.8 8.9 8.95 9 9.05 9.1 9.15 9.2 9.25 9.3
## 18 2 22 23 1 18 1 17 2 22 2 11
## 9.4 9.5 9.55 9.6 9.65 9.7 9.8 9.85 9.9 10 10.05 10.1
## 10 9 1 18 4 22 16 3 18 18 3 14
## 10.2 10.3 10.4 10.5 10.55 10.6 10.65 10.7 10.8 10.9 11 11.1
## 23 16 25 16 1 22 1 26 17 11 19 18
## 11.2 11.25 11.3 11.4 11.45 11.5 11.6 11.7 11.75 11.8 11.9 11.95
## 18 2 12 14 1 11 15 8 4 35 16 3
## 12 12.05 12.1 12.15 12.2 12.3 12.4 12.5 12.55 12.6 12.7 12.75
## 16 1 21 4 15 13 19 16 2 16 16 1
## 12.8 12.85 12.9 13 13.1 13.15 13.2 13.3 13.4 13.5 13.55 13.6
## 25 4 25 19 23 1 13 16 7 10 3 12
## 13.65 13.7 13.8 13.9 14 14.05 14.1 14.15 14.2 14.3 14.35 14.4
## 4 21 8 18 16 1 4 1 20 17 3 17
## 14.45 14.5 14.55 14.6 14.7 14.75 14.8 14.9 14.95 15 15.1 15.15
## 3 17 3 13 14 2 12 14 2 13 7 1
## 15.2 15.25 15.3 15.4 15.5 15.55 15.6 15.7 15.75 15.8 15.9 16
## 6 1 9 17 11 6 14 9 1 6 2 10
## 16.05 16.1 16.2 16.3 16.4 16.45 16.5 16.55 16.6 16.65 16.7 16.75
## 6 2 7 7 5 1 3 1 2 5 5 2
## 16.8 16.85 16.9 16.95 17 17.05 17.1 17.2 17.3 17.35 17.4 17.45
## 4 4 3 3 1 1 5 9 14 1 2 2
## 17.5 17.55 17.6 17.7 17.75 17.8 17.85 17.9 17.95 18 18.05 18.1
## 8 3 2 1 4 13 5 2 3 2 3 6
## 18.15 18.2 18.3 18.35 18.4 18.5 18.6 18.75 18.8 18.9 18.95 19.1
## 8 3 2 4 1 1 1 4 3 1 3 1
## 19.25 19.3 19.35 19.4 19.45 19.5 19.6 19.8 19.9 19.95 20.15 20.2
## 3 4 1 2 3 2 1 4 1 3 1 2
## 20.3 20.4 20.7 20.8 22 22.6 23.5 26.05 31.6 65.8
## 1 1 2 2 2 1 1 2 2 1
##
## 0.9 1.2 1.3 1.4 1.5 1.6 1.65 1.7 1.75 1.8 1.9 2 2.05 2.1 2.15
## 2 8 5 35 30 58 2 76 2 129 117 156 2 128 2
## 2.2 2.25 2.3 2.35 2.4 2.5 2.55 2.6 2.65 2.7 2.8 2.85 2.9 2.95 3
## 131 1 109 1 86 84 1 79 1 39 49 1 24 1 25
## 3.1 3.2 3.3 3.4 3.45 3.5 3.6 3.65 3.7 3.75 3.8 3.9 4 4.1 4.2
## 7 15 11 15 1 2 8 1 4 1 8 6 11 6 5
## 4.25 4.3 4.4 4.5 4.6 4.65 4.7 4.8 5 5.1 5.15 5.2 5.4 5.5 5.6
## 1 8 4 4 6 2 1 3 1 5 1 3 1 8 6
## 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4 6.55 6.6 6.7 7 7.2 7.3 7.5
## 1 4 3 4 4 3 2 3 2 2 2 1 1 1 1
## 7.8 7.9 8.1 8.3 8.6 8.8 8.9 9 10.7 11 12.9 13.4 13.8 13.9 15.4
## 2 3 2 3 1 2 1 1 1 2 1 1 2 1 2
## 15.5
## 1
There is an outlier at around 65, majority are between 0.6 to 21
Observation about residual sugar of wine:
White wine’s residual.sugar goes till 20 whereas red wine’s residual sugar goes to around 9 with an outlier of 65. So some of the white wine seems to be sweeter than the red wine. (http://wine.about.com/od/wineandhealth/qt/Which-Wine-Has-The-Least-Sugar.htm)
As we can see from the table’d data of red and white residual.sugar, a lot of wines in the sample are dry red and white wines. But some of the white wine seem to be off-dry wines were the residual.sugar fall between 10 - 30 grams and some have the sweetness of a champagne (6 - 20 grams.)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.009 0.038 0.047 0.056 0.065 0.611
Chloride levels seemed to be skewed as you can see even though the mean is 0.05603 the max value is 0.61100, so going to use log10 scale to further analyze.
Observation about Chlorides in wine:
Few white wines have lesser chloride levels. There are some outliers for red wine chloride levels.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 17.0 29.0 30.5 41.0 289.0
free sulfur dioxide data seems to be skewed so using log10 to further analyze.
Observation about free sulfur dioxide in wine:
More white wines have higher levels of free sulfur dioxide. There are some outliers for white wine at 289.00.
This maybe because of the following reason: (http://www.morethanorganic.com/sulphur-in-the-bottle) Red wines do not need any added sulphur dioxide because they naturally contain anti-oxidants, acquired from their skins and stems during fermentation. Conventional winemakers add some anyway. White wines and rosés do not contain natural anti-oxidants because they are not left in contact with their skins after crushing. For this reason they are more prone to oxidation and tend to be given larger doses of sulphur dioxide.
(http://en.wikipedia.org/wiki/Winemaking)
Also if you see many of our white wines were sweeter than the red wines – Sweet wines or off-dry wines are made by arresting fermentation before all sugar has been converted into ethanol and allowing some residual sugar to remain. This can be done by chilling the wine and adding sulphur and other allowable additives to inhibit yeast activity or sterile filtering the wine to remove all yeast and bacteria.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6 77 118 116 156 440
total sulfur dioxide data seems to be skewed so using log10 to further analyze.
Observation about total sulfur dioxide in wine:
More white wines have higher levels of total sulfur dioxide just as free sulfur dioxide. There are some outliers for white wine around 350.0.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.987 0.992 0.995 0.995 0.997 1.040
density data seems to be skewed so using log10 to further analyze.
Observation about density in wine:
There is an outlier 1.03911 and between 1.00911 and 1.01111
Else most wine’s density range from 0.987 to 1.0031.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.72 3.11 3.21 3.22 3.32 4.01
Observation about pH in wine:
The wine’s in our sample have pH in the range of 3 to 3.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.220 0.430 0.510 0.531 0.600 2.000
Observation about sulphates in wine:
There are some gaps in the data, either there is no data with those sulphate values was gathered or wines don’t have that sulphate value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.0 9.5 10.3 10.5 11.3 14.9
Observation about Alcohol in wine:
Both red and white has the same alcohol distribution pattern.
The peak is around 9.5 for both red and white wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 6.00 5.82 6.00 9.00
## bad average good
## 246 4974 1277
Observation about Quality of wine:
The distribution of wine quality graph appears to have the shape of normal distribution, the Quality is at peak at 5 and 6.
Also created a new variable Quality Rating which classified the wines into Bad, Average and Good bucket based on the quality of wine. Majority fell in the Average rating bucket.
Did you create any new variables from existing variables in the dataset?
Created a new variable quality_rating which classified the wine’s into Bad, Average and Good bucket based on the quality of wine.
Of the features you investigated, were there any unusual distributions?
Density distribution of white wine is bimodal and of red wine is normal distribution.
Did you perform any operations on the data to tidy, adjust, or change the form of the data?
I did not tidy the data but to be able to analyze some of the skewed data I had to use log10.
pairs(wine)
After reviewing the ggpairs for strong correlation.
We see that there is a strong correlation between the following that can be analyzed further:
we can ignore the correlation between free.sulfur.dioxide and total.sulfur.dioxide as free.S02 is part of total.SO2, total.sulfur.dioxide vs free.sulfur.dioxide(corr - 0.721)
free.sulfur.dioxide vs residual.sugar(corr - 0.403), since the correlation between total.sulfur.dioxide vs residual.sugar is high we are ignoring the correlation between free.sulfur.dioxide vs residual.sugar.
Ref.:http://www.inside-r.org/packages/cran/psych/docs/pairs.panels
We can see few of the top correlation pairs are:
alcohol vs. density(corr - -0.69)
density vs residual.sugar(corr - 0.55)
total.sulfur.dioxide vs residual.sugar(corr - 0.50)
density vs fixed.acidity(corr - 0.46)
quality vs alcohol (corr -0.44)
total.sulfur.dioxide vs volatile.acidity(corr - -0.41)
chlorides vs sulphates (corr - 0.40)
chlorides vs volatile.acidity (corr - 0.38)
citric.acid vs fixed.acidity(corr - -0.38)
density vs chlorides (corr - 0.36)
alcohol vs residual.sugar (corr - -0.36)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 6.00 5.82 6.00 9.00
Function to generate graphs to analyze different elements correlation with quality factor
The quality of wine vs. Alcohol using box plots as it plays an important role in the microbial stabilization of both red and white wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.0 9.5 10.3 10.5 11.3 14.9
In order to analyze the relationship between alcohol and quality, let us see how the alcohol values are distributed across varying quality and how it varies with quality.
## 3 4 5 6 7 8 9
## 10.215 10.180 9.838 10.588 11.386 11.679 12.180
Visually alcohol by quality levels along with median and mean is:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.0 9.5 10.3 10.5 11.3 14.9
Observation about Alcohol vs. Quality of Wine:
Both red and white wine that are beyond the mean quality value of 5.818 show values beyond the mean alcohol value of 10.49.
In our sample only some white wines have the highest quality of 9.
The quality of wine vs. Residual sugar is displayed using box plots as it an essential component in the production of wine.
During alcoholic fermentation, yeast feeds on the sugar found in grape juice and converts it to ethyl alcohol, or ethanol, and carbon dioxide. The amount of sugar fermented determines the wine’s alcohol level and the amount of residual sugar left in the wine.
Ref: https://winemakermag.com/501-measuring-residual-sugar-techniques
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.60 1.80 3.00 5.44 8.10 65.80
In order to analyze the relationship between residual.sugar and quality, let us see how the residual.sugar values are distributed across varying quality and how it varies with quality.
## 3 4 5 6 7 8 9
## 5.140 4.154 5.804 5.550 4.732 5.383 4.120
Visually residual.sugar by quality levels along with median and mean is:
Observation about residual.sugar vs. Quality of Wine:
Red wine quality is not impacted by residual.sugar and has less residual.sugar
White wine of highest quality of 9 has residual.sugar less than the mean residual.sugar value.
White wine has higher residual.sugar than red wine.
Interesting Fact:* Winemaker who wishes to make a wine with high levels of residual sugar (like a dessert wine) may stop fermentation early either by dropping the temperature of the must to stun the yeast or by adding a high level of alcohol (like brandy) to the must to kill off the yeast and create a fortified wine.[9]*
Ref.: http://en.wikipedia.org/wiki/Fermentation_in_winemaking
The quality of wine vs. chlorides which acts as a preserving agents in the preparation of liquid enzyme preparation which in turn is important for the microbiological stability of wines.
Ref.: http://www.westchesterwinemakers.com/2010/06/03/enzymes-in-winemaking-do-we-use-them-damm-straight-we-do/
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.009 0.038 0.047 0.056 0.065 0.611
In order to analyze the relationship between chlorides and quality, let us see how the chloride values are distributed across varying quality and how it varies with quality.
## 3 4 5 6 7 8 9
## 0.07703 0.06006 0.06467 0.05416 0.04527 0.04112 0.02740
Visually chlorides by quality levels along with median and mean are:
Observation about Chlorides vs. Quality of Wine:
Both red and white wine that has less chlorides have high quality.
Red wine has more chloride content than white wine. White wine’s chloride content is below the mean chloride.
White wine has lower chloride levels than red wine.
The quality of wine vs. density using box plots.
It is generally used as a measure of the conversion of sugar to alcohol. The must, with sugar but no alcohol, has a high density. The finished wine has less sugar but lots of alcohol and thus has a lower density. The difference between the two is used to calculate the alcohol content.
https://answers.yahoo.com/question/index?qid=20140527020443AALJISW
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.987 0.992 0.995 0.995 0.997 1.040
In order to analyze the relationship between density and quality, let us see how the alcohol values are distributed across varying quality and how it varies with quality.
## 3 4 5 6 7 8 9
## 0.9957 0.9948 0.9958 0.9946 0.9931 0.9925 0.9915
Visually density by quality levels along with median and mean is:
Observation about Density vs. Quality of Wine:
Both red and white wine that has less density has high quality.
Red wine is more denser than white wine.
In our sample lot of white wines fall under the quality bucket that are between 4.5 to 7.5 only few have a high quality of 8.
In our sample of red wines majority are between quality 4.5 to 6.5 only some are quality level 7 and very few at 8.
As you can see from SO2 vs Quality, Sulphates vs Quality and fixed.acidity vs Quality graphs
The quality of wine varies from 4.5 to 7.5 for both red and white wine irrespective of SO2, sulphates or fixed.acidity level.
Very few white wines are of high quality but the contribution of these elements seems to have no impact on quality.
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
Alcohol strongly correlates with quality of wine, as alcohol content increases wine quality increases.
Red wine quality is not impacted by residual.sugar and has less residual.sugar. White wine of highest quality of 9 has residual.sugar less than the mean residual.sugar value.
White wine has higher residual.sugar than red wine.
Both red and white wine that has lower chloride level has high quality.
Both red and white wine that has less density has high quality.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
The relationship between some elements varies with the color of wine. density vs. fixed.acidity,
chlorides vs. sulphates,
fixed.acidity vs. citric.acid and
residual.sugar vs. alcohol.
What was the strongest relationship you found? Alcohol vs Quality is the strongest relation I found for both wine as per given data.
By plotting against each other and faceted by wine quality_rating:
The correlation between alcohol and density is strong and -ve for both white and red wines.
As the alcohol level increases the density of wine decreases
The correlation between residual.sugar and density is strong and +ve for white and red wines.
As residual.sugar increases density also increases as we can see from the average and god quality red and white wines.
The density of wine is close to that of water, dry wine is less, sweet wine is higher. Water has a density of 1.000 Kg/L Ethanol has a density of 0.789 Kg/L Sugar has a density of 1.587 Kg/L
So wine with 13% alcohol by volume and 0.5% sugar by volume has a density of
0.13 x 789 + 0.005 x 1587 + 0.865 x 1000 = 975.5 Kg/L
(Ref.:http://www.answers.com/Q/What_is_the_density_of_wine)
The correlation between residual.sugar and total.sulfur.dioxide is weak for white and red wine.
The correlation between density and fixed.acidity is strong and positive for red wine and none for white wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 6.00 5.82 6.00 9.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6 77 118 116 156 440
There is no correlation between volatile.acidity and total.sulfur.dioxide for red and white wines.
The correlation between chlorides and sulphates is strong and positive for red and none for white wines.
There is no correlation between chlorides and volatile.acidity for red and white wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.250 0.310 0.319 0.390 1.660
The correlation between fixed.acidity and citric.acid is strong for red wines and for white wines the correlation between fixed.acidity and citric.acid weakens as it goes from bad to good quality rating.
The correlation between chlorides and density is strong for red and white wines.
The correlation between alcohol and residual.sugar is strong for white wines and weak to none for red wines.
During alcoholic fermentation, yeast feeds on the sugar found in grape juice and converts it to ethyl alcohol, or ethanol, and carbon dioxide. The amount of sugar fermented determines the wine’s alcohol level and the amount of residual sugar left in the wine. (Ref.:https://winemakermag.com/501-measuring-residual-sugar-techniques)
So as residual.sugar level increases alcohol level decreases for white wine.
| Element pairs Correlation | Red | White | Corr |
|---|---|---|---|
| alcohol vs. density | S | S | 0.69 |
| residual.sugar vs. density | S | S | 0.55 |
| residual.sugar vs. total.sulfur.dioxide | W | W | 0.50 |
| density vs. fixed.acidity | S | N | 0.46 |
| quality vs. alcohol | S | S | 0.44 |
| volatile.acidity vs. total.sulfur.dioxide | N | N | 0.41 |
| chlorides vs. sulphates | S | N | 0.40 |
| volatile.acidity vs. chlorides | N | N | 0.38 |
| fixed.acidity vs. citric.acid | S | W | 0.38 |
| chlorides vs. density | S | S | 0.36 |
| residual.sugar vs. alcohol | N | S | 0.36 |
From above it is evident that the following correlations depend on the color of the wine
density vs. fixed.acidity,
chlorides vs. sulphates,
fixed.acidity vs. citric.acid and
residual.sugar vs. alcohol.
Since the number of Red wine is 1/3rd of number of white wine in the sample the correlation between the elements of the sample follow the white rather than red.
So below we are going to analyze some of the key correlations of red wine.
In case of red wine the top correlation are between the following elements
| Element pairs Correlation | Corr |
|---|---|
| fixed.acidity vs pH | (-)0.68 |
| fixed.acidity vs citric.acid | 0.67 |
| fixed.acidity vs density | 0.67 |
| volatile.acidity vs citric.acid | (-)0.55 |
| citric.acid vs pH | (-)0.54 |
| density vs. alcohol | (-)0.50 |
##
## bad average good
## 63 1319 217
As you can see the pH level decreases as acidity increases The correlation between pH and fixed.acidity is negative and does not provide a clear relationship to quality.
Now let us look at the correlation between pH and fixed.acidity for good and bad quality wine
Wine of quality level 6 has a higher concentration between fixed.acidity level 6 and 10 and citric.acid level between 0 and 0.37.
** Further analyzing the correlation for bad and good quality wine**
As fixed.acidity increases there is an increase in the citric.acid level in Red wine, maybe because citric.acid is a form of Titratable acidity (i.e. fixed.acidity).
Quality level 7 has higher content of citric.acid, indicating higher quality of red wines has more citric.acid in them
Quality of red wine increases along with the increase in the concentration of fixed.acidity and density.
The correlation between volatile.acidity and citric.acid is negative that is as volatile.acidity increases the citric.acid of red wine decreases.
And majority of the wine with high levels of citric acid is in quality level 7 and those with lower levels fall in the quality level 5 range.
This supports the previous theory that level of citric.acid in red wine contributes towards its quality factor.
While fixed.acidity has a positive impact on wine quality volatile.acidity seems to have a negative impact on quality.
pH and Citric.acid correlation seems to have a positive impact on the quality of red wine one way or other.
Most of the good quality wines pH fall between 3 and 3.5 and citric.acid levels increase in good quality wine.
Majority of red wine with Quality factor of 7 has alcohol content above 10.
Fixed.acidity is less
Citric.acid is high
Alcohol is high
pH between 3 and 3.5
In case of white wine the top correlation are between the following elements
| Element pairs Correlation | Corr |
|---|---|
| residual.sugar vs density | 0.84 |
| density vs. alcohol | (-)0.78 |
| total.sulfur.dioxide vs density | 0.53 |
| residual.sugar vs alcohol | (-)0.45 |
| total.sulfur.dioxide vs alcohol | (-)0.45 |
| pH vs. fixed.acidity | (-)0.43 |
The white wine quality is high when the density of wine is less.
The white wine quality is high when alcohol is high but the correlation between alcohol and density is negative. This again confirms our above finding about density.
##
## 10 19 25 26 29 31 33 34 37 44 47 49
## 1 1 1 1 1 1 1 2 2 1 1 1
## 50 51 53 55 56 57 59 60 61 62 64 65
## 2 1 1 3 2 2 2 1 3 1 1 5
## 66 67 68 69 70 71 72 73 74 75 76 77
## 2 2 5 3 3 3 5 7 3 7 11 2
## 78 79 80 81 82 83 84 85 86 87 88 89
## 5 6 6 7 7 10 5 5 10 13 9 12
## 90 91 92 93 94 95 96 97 98 99 100 101
## 9 3 9 18 10 11 13 16 22 8 14 10
## 102 103 104 105 106 107 108 109 110 111 112 113
## 20 6 13 14 11 19 8 7 11 22 7 24
## 114 115 115.5 116 117 118 119 120 121 122 123 124
## 25 16 1 10 13 20 13 12 15 17 8 17
## 125 126 127 128 129 129.5 130 131 132 133 134 135
## 13 13 15 19 11 2 15 9 14 12 18 14
## 136 137 138 139 140 141 142 143 144 145 146 147
## 10 3 18 4 15 6 15 12 7 5 4 8
## 148 149 150 151 152 153 154 155 156 157 158 159
## 10 15 15 10 12 8 4 12 5 6 9 2
## 160 161 162 163 164 165 166 167 168 169 170 171
## 3 5 8 7 6 4 5 7 13 4 5 10
## 172 173 174 175 177 178 179 180 181 182 184 185
## 4 4 3 3 5 9 4 4 5 2 1 1
## 186 187 188 189 189.5 190 191 192 193 194 195 196
## 4 3 2 10 1 1 4 5 3 1 3 2
## 197 198 199 200 201 203 205 206 208 209 210 212
## 2 1 2 7 2 3 4 1 1 1 2 3
## 212.5 214 216 225 227 228 229 233 234.5 245 272 307.5
## 6 2 1 1 1 1 6 1 1 1 1 1
## 366.5 440
## 1 1
Majority of white wine with low density and total.sulfur.dioxide between 75mg/L to 175mg/L seem to have high quality.
(Ref.http://www.practicalwinery.com/janfeb09/page5.htm) – White wines have more total SO2 than red wines (as dessert and fortifiedwines, that are very sweet we would need more SO2).
The white wine quality is higher when residual.sugar is below 10 and alochol content is high. Further analyzing for only bad and good quality white wines.
The quality of white wine is high when total.sulfur.dioxide is < 250 and alcohol content is high.
correlation of pH vs. fixed.acidity in relation to quality is inconclusive. All we can say is wines with pH between 3 and 3.5 have good quality and the fixed.acidity level is between 5 and 8.
Quality of white wine is good, when
Density is less
residual.sugar is below 10g
alcohol content is high
Total.sulfur.dioxide is between 75mg/L to 175mg/L .
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
The relationship between alcohol and density is -ve and strong which has a positive impact on the quality of wine.
In case of white wine the strongest correlation(+ve) is between residual.sugar and density.
In case of red wine the strongest correlation (-ve) was between fixed.acidity and pH.
Were there any interesting or surprising interactions between features?
Correlation between some of the elements was dependent on the wine.
Plot One
Description One:
Wort /ˈwɜrt/ is the liquid extracted from the mashing process during the brewing of beer or whisky. Wort contains the sugars that will be fermented by the brewing yeast to produce alcohol.
The density of a wort is largely dependent on the sugar content of the wort. During alcohol fermentation, yeast converts sugars into carbon dioxide and alcohol. The decline in the sugar content and the presence of ethanol (which is appreciably less dense than water) drop the density of the wort.
Ref.:http://en.wikipedia.org/wiki/Gravity_%28alcoholic_beverage%29
Density is generally used as a measure of the conversion of sugar to alcohol. The must, with sugar but no alcohol, has a high density. The finished wine has less sugar but lots of alcohol and thus has a lower density. The difference between the two is used to calculate the alcohol content.
White wine seem to be have lower density than red wine.
So from the graphs it is evident that wines with low density have high quality.
Below we are going to see the impact of alcohol level on the wine quality.
Plot Two
Description Two
The Quality of wine (both red and white) seems to increase with the increase in the level of alcohol, in our sample except of wines with quality level of 5 the rest seems to support that theory.
The below graph shows the correlation between alcohol and density in red and white wines in our sample.
Plot Three
So both red and white wine the alcohol and density has a strong -ve correlation i.e. as alcohol level increases density of the wine decreases.
Also alcohol and density have a strong -ve correlation of -0.69.
Now the impact of density and alcohol on quality_rating of wine can be depicted as
##
## bad average good
## 246 4974 1277
Description Three
Even though our graph and the data does indicate that higher alcohol content and lower density contribute to a good quality wine. The correlation between quality vs. alcohol doesn’t seem to be that strong (0.44).
So to analyze that further plotted the correlation using quality_rating, which showed that lower density and higher alcohol level content in wines have a direct correlation with the quality of wine.
The table of quality_rating showed the reason for the weaker correlation is majority of our wine sample fall under (4,6] which is average quality bucket.
Below link gives the 5 key components of wine.
http://www.snooth.com/articles/five-key-wine-components-and-how-to-detect-them/?viewall=1
The wine data set contains information from both red and white wine. I started by understanding the individual variables in the data set by plotting graphs and also visiting websites to see what contribution each elements make.
Then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality of wine based on density and alcohol.
It is interesting that even though the graph does show that increase in alcohol content is an indication of good quality wine, the correlation between quality and alcohol is not strong.
Then further analyzing realized that the majority of the sample of data falls between 4 - 6 quality (which is average) and hence maybe the correlation is not a true reflection.
Our sample has 250 mg/l for white wines with residual sugar greater than 5 g/litre (Moelleux wines), and 300 mg/l for liquoreux sweet wines (Ref.:http://en.wikipedia.org/wiki/White_wine)
According to http://winemakersacademy.com/importance-ph-wine-making/, pH is the backbone of any wine. Even though the data shows that the wine’s in our sample have pH in the range of 3 to 3.5. It does not have strong relation to the quality of wine which was kind of surprising to me.
The sample data positively reinforces the characteristics of the components in a white wine.
White wines and rosés do not contain natural anti-oxidants because they are not left in contact with their skins after crushing. For this reason they are more prone to oxidation and tend to be given larger doses of sulphur dioxide.
Also if you see many of our white wines were sweeter than the red wines – Sweet wines or off-dry wines are made by arresting fermentation before all sugar has been converted into ethanol and allowing some residual sugar to remain. This can be done by chilling the wine and adding sulphur and other allowable additives to inhibit yeast activity or sterile filtering the wine to remove all yeast and bacteria.
During alcoholic fermentation, yeast feeds on the sugar found in grape juice and converts it to ethyl alcohol, or ethanol, and carbon dioxide. The amount of sugar fermented determines the wine’s alcohol level and the amount of residual sugar left in the wine.
For further analysis
The data should have more red wine sample so the analysis is not favoring the characteristic of one wine over another.
Also it should have additional fields for the -
Type of wine like if it was dry, off-dry , fortified wine, Sparkling wine etc. Because in the current analysis I used the data and assumed it was what type of wine and based on that assessed the quality. Since certain type of wine should have components at certain level my analysis may not have been accurate.
Color of wine ( as wine ages with color).
Datasets were obtained from:
Technical documents referenced in use for plotting graph and analysis
Links referenced
http://statistics.ats.ucla.edu/stat/r/dae/tobit.htm
http://www.inside-r.org/packages/cran/psych/docs/pairs.panels
http://www.r-project.org/other-docs.html
http://cran.r-project.org/manuals.html
http://www.r-bloggers.com/choosing-colour-palettes-part-ii-educated-choices/
Trying to understand the wine, its components and how its quality is determined
https://winemakermag.com/501-measuring-residual-sugar-techniques
http://en.wikipedia.org/wiki/Fermentation_in_winemaking
https://answers.yahoo.com/question/index?qid=20140527020443AALJISW
http://www.snooth.com/articles/five-key-wine-components-and-how-to-detect-them/?viewall=1
http://winemakersacademy.com/understanding-wine-acidity
http://wine.about.com/od/wineandhealth/qt/Which-Wine-Has-The-Least-Sugar.htm
http://www.morethanorganic.com/sulphur-in-the-bottle
http://en.wikipedia.org/wiki/Winemaking
http://www.answers.com/Q/What_is_the_density_of_wine
http://www.practicalwinery.com/janfeb09/page5.htm
http://en.wikipedia.org/wiki/Gravity_%28alcoholic_beverage%29
The main issue I had was to understand the components of wine not from the given data but from the real process of wine making. So read up on some article to obtain that understanding. Even though some of the components that were mentioned are of importance in the wine making were not reflected in the given sample of data so then tried to analyze them separately and then realized that the correlation of some of the components were color (red or white ) dependent.